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Tnis paper describes a technique for recording pupils' experiences 
in the classroom and discusses certain issues related to the 
reliability of scores on such records. The instrument is called the 
Personal Record of School Experience or PROSE . 

The basic idea behind all versions of OScAR (of v;hich PROSE is 
the latest) is that important information about classroom processes 

f • 

may be derived from simple frequency counts of readily observable elements 
of classroom behavior, and that such information can meet the essential 
requirements for objective measurement. This is primarily achieved 
by separating the observing and recording of behavior on the one hand 
from the interpretation and dimensional izat ion of the records on the 

other. In otlier v/ords, the tv/o main steps in the process of behavior 

^ ■ 

measurement are performed by different people at different times. 

The recorder's function is to see, discriminate, and record 

X 

behaviours; it is neither necessary nor desirable for him to have any 
very cl^ar idea of the significance or meaning of any behavior he 
records. The only judgments or discriminations he needs to make are 
those necessary for recognizing v/hich of a set of categories an observed 
event best fits into. Objectivity is ensured by defining categories so 
that these discriminations are based (1) on relatively obvious and 
easily Recognized cues, and (2) on cues \/hich are minimally dependent 
on sophisticated knowledge or on the observer's ovrn set of values. An 
important byproduct of this is that sub--professional personnel can be 
trained to do the observations, so that professionally trained 
personnel are required only for training and supervision. 

Objectivity in the second step of the process of behavior 
measurerment — interpreting the record--is assured by turning it over to 
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the machines. Since PROSE records can be "read'^ and scored by machines, 
the only human intervention betv/een the observer's record and the 

^ I 

interpretation is to specify in advance how a record is to be 

r 

interpreted-'-in other words, to specify the scoring key. 

Exhibit 1 is a copy of the instrument itself. You will note that 

f '• 

it takes the form of an optical scanning sheet — that is, one v/liich can 
be read' direct3.y onto computer tape by machine. This form was adopted 
partly to reduce the magnitude of the clerical task involved in coding 
observational records for analysis , and partly to increase the 
objectivity of the measurements ultimately derived from the records. 

The PROSE recorder enters the classroom early in the morning-- 

t 

before' tlie pupils, if possible — carrying a loose-leaf binder and 
wearing a compact cassette player over his shoulder. The binder 
contains one or more PROSE forms for each pupil he plans to observe, 
arranged in the random order in which he plans to observe them. The 
cassette player contains a prerecorded tape v/hich emits signals through 

an earphone at 25 second intervals. 

As soon as the pupils arrive the recorder spots the first child 
on his;. list and watches what is happening to him. As soon as he hears 
a signal from the tape, he records whatever event in the child's life 

A 

is happening at that time. This continues until five events have 

*> 

been recorded, 25 seconds apart. Then the recorder pauses to record 
the general classroom conditions prevailing during the five events. 

The recorder then locates the next child to be observed and 
repeatp the process, until all the children to be observed have been 
seen once. Then he again observes the first child for another cycle 

o 
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of five events, and repeats the process as often as time permits or 
the project requires. 

Each event that is recorded is classified on eleven sets of 
categories, or words .listed on the "statement" side of the form. Three 

V 

basic Hinds of events are recognized: peer contacts, adult contacts, 

and others. It will be convenient to illustrate the coding process in 
terms of an event of the second type — an adult contact. 

Suppose, for example, that at the time when a signal is heard, 
the child being observed is sitting quietly on the floor with the rest 
of the pupils in the class listening to a story the teacher is telling. 

The recorder considers the first v^^ord, which contains four options. 
INIT (initiating) is marked if the pupil is seeking to change the kind 
of attention he is receiving from an adult. STAR is marked if an 
adult is paying a different kind of attention to the child being 
observed than to any other child. PART is marked if the adult is 
giving attention to a group of two or more children of v;hom the child 
being qbserved is one. LSI7T (listening or watching) is narked if the 
child 4s attending to an adult who is not paying attention to the child. 

f 

The word is omitted v/hen the child is not in contact with any adult. 

The kind of discriminations required of a PROSE recorder are 
well illustrated by this word — they are based on overt cues which 
demand', very little inferring of intent or effects of behavior. Let us 
return' to our example of the child listening to a story. 

The recorder will make a mark after the word PART in column one 

on the first word because the child is part of a group to which the 

( 

teacheiC is attending. If the teacher had been talking to this child 
only, STAR vi^ould have been marked on this word. On the second wovd, the 
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recorder marks TCHR (teacher) in the first column to indicate that 
the adu!|.t with whom the child is in contact is the teacher-in -charge . 

On word' 3 he would mark ShTL (show or tell) to indicate that the 
teacher is shov/ing or telling the children something. 

Words 4 and 5 would be left blank because they apply only to peer 
contact^. Word 6 would be marked after VRBL (\^erbal) to indicate 
that the contact was verbal . (This tells us that in this case the 
teacher is tell ing , i ^t shov/ing ) . And so on with the remaining words, 
whicii record the child's level of attention, race and sex, activity 
level, the nature of the task he is performing (if any), and any 
manifest affect. 

When the observer has considered and appropriately marked each 
of the eleven words in column one, he has recorded v/hat v/e call a 
statemeht about the event in progress v;hen he heard the signal. 

This process takes from ten to fifteen seconds, so that when the 
next signal is emitted, the recorder can observe the event in 
process, at that moment. 

It is the purpose of the timer and the predetermimad random order 
for observing pupils to ensure that the events observed will be a 
representative sample of those occurring in the classroom. 

The context in v/hich each set of five statements was recorded is 
coded oil the back of the form immediately after the fifth event has 
been recorded. It indicates such things as what kind of group the child 
was in ^nd his attitude toward it, v/hat the apparent instructional 
objectives were and what roles the teacher and each other adult 
present played, what materials were used, and where in the room the 
child was. The observer also records any of a number of specified 
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incidents that may have happened during the cycle of five statements 
1 

at other times than when the timing signal v;as given. 

The output of each cycle of observations, which takes about five 
minutes to obtain, consists of five eleven -word statements describing 
five events plus a description of the context in ;7hich they v;ere 
observed. 

The number of different interpretible statements that may be 
composed using only the eleven words provided is estimated to be 
more than 200,000; when it is considered that the context in v/hich 
any statement v;as made may also vary considerably, it becomes apparent 
that there is considerable scope for uniqueness and detail in the 
recording of a single event . And yet the task of the recorder is 
fairly simple: within the capabilities of a para-professional, for 

example . 

Tl^e basic approach to scoring PROSE records is to specify a 
priori a set of statements describing events which would be expected 
to occyir in the lives of children possessing a certain characteristic 
of interest — aggressive children, or dependent children, or children 
in a libntessori program, or whatever. Such a specification is in 
effect ;an operational definition of a variable. 

The computer is then asked to search each record or set of records 
and determine the proportion of all statements in the record or set 
that belong to tlie specified set of statements. That proportion is 
the obtained score on the variable in question. 

I would like to use the rest of the time allotted to me to discuss 
some ii^sues related to reliability of observations which have come 
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up in our work with PROSE. Vie are particularly concerned about 
common = mis uses of coefficients of observer agreement. 

Reliability of measurement has to do with hovj far; an obtained 
score is likely to be from the true score it is supposed to measure. 
What is this true score in the present case? If we imagine that 
instead of one recorder observing one child at one time, every 
recorder in a population of recorders observed all possible children 
at all, possible times, then the proportions of specified statements 
among all the statements recorded would be the true score in question. 
This true score, then, is the mean or expected proportion in the 
population of recorders, times, and children. 

In a typical observational study, different recorders observe 
different children at different times, so the data of tlie study 
usually contain all the information needed for reliability estimation. 

V 

Exhibit 2 is an outline of the analysis of variance of a typical set 
of data collected to test v/hether children in different groups 
differ on some variable of interest. The exhibit also shov;s how 
various coefficients may easily be estimated from the mean squares 
of this|| analysis. 

The common practice of doing a separate reliability study before 
collecting the main data of an investigation is wasteful and should 
be discontinued. The only purpose such an undertaking can accomplish 
is to find out whether a team of recorders has been adequately trained 
or not — and i:. is not well adapted to that purpose, either. 

Tl^e strategy usually followed is to send observers into the 
field in groups to make simultaneous records of v/hatever behaviors 
they may observe, and tlien to compare the records and calculate 
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some kind of coefficient of agreement. If the coefficient is too 
low, the presumption is that the observers need more training. 

This may or may not be true. VJhen two observers record the 
same ey^nt differently — v/hen they disagree on hov/ a behavior should 
be coded — it may be because one or both of them does not knov; the 
category definitions? or it may be because the behavior itself is 
ambiguous, that is, contains elements codable in tv;o or more 
categories . 

Pupils and teachers do not organize their behaviors according 
to our categories; many of the thin s they do belong partly in one 
category and partly in another. If we send enough competent 
observers to observe such a behavior, some v;ill code it one v;ay, 
some anbther. The proportion of behaviors coded in either category 
will be neither zero nor one but somev/here betv/een, which is where 
it should be. Brainwashing a team of observers to a point v;here 
they would all code the same behavior in the same category would 

lower their accuracy instead of increasing itl 

The way to find out v/hether a team of recorders is competent 
is to have them all code a set of filmed or videotaped samples of 
behavior preselected to contain unambiguous excimples of the kinds 
of behavior they are supposed to record. Any disagreements found in 
such a situation may be taken as evidence of insufficient training, 
and near-perfect agreement may be taken as evidence of competence in 
the system. 

Indices of observer agreement in coding the same behaviors have 
very little to do with the reliability of behavior measurements 

o 
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anyhow.;,’ Since the raeasurements are usually based on composite 

i. 

frequencies over two or more categories, some observer disagreements 

i ' 

count as agreements on one variable and as disagreements on another. 

And in ^iny case, errors due to observer agreement tend to be negligible 
in comparison to errors from other sources, such as the instability 
of the behaviors themselves. Indices of observer agreement have a 
unique function and their use should be restricted to that function. 
In any case they should not be cited as evidence of reliability. 
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ESTIilATIIIG THE RELIABILITY OF SCORES BASED ON 
OBSERVATIONAL RECORDS 

A Study is posited in v/hich N individuals divided into G 

♦ 

groups are observed by R recorders a total of V tii;ies each. 

Each recorder observes each individual at least oncer and 
observes every individual the same numljer of tines. This nui.iber 
may vary across recorders — one recorder may see all individuals 
twice; another may see them three times each. No tv/o recorders 
observe the same individual at the sanie time . 

.^The analysis of variance below is the one appropriate for 
testing for differences in behavior of individuals in different 
groups, and the formulas show how various coefficients may be 
estimated from the mean squares obtained in the analysis . 



ANALYSIS OF VARIANCE 
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Reliability (Estimates the 
correlation betv/een the set 
of Scores actually obtained 
and a similar set obtained by 
different recorders observing 
the individuals at different 
times) 


r= (b-f)/b 


r=(a-b-e+g)/a 


Stability (Estimates the 
correlation betv/een tlie set 
of spores actually obtained 
and fi similar set obtained 
by the same recorders 
observing tlie individuals at 
different times) 


r=(b-h)/b 


r=(a-b+f-g) /a 


Observer Agreement (Estimates 
the correlation between the 
set of scores actually 
obtained and a similar set 
obtained by different 
recorders observing tlie 
individuals at the same times) 
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